12 research outputs found
Multilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads
text as bytes and outputs span annotations of the form [start, length, label]
where start positions, lengths, and labels are separate entries in our
vocabulary. Because we operate directly on unicode bytes rather than
language-specific words or characters, we can analyze text in many languages
with a single model. Due to the small vocabulary size, these multilingual
models are very compact, but produce results similar to or better than the
state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that
use only the provided training datasets (no external data sources). Our models
are learning "from scratch" in that they do not rely on any elements of the
standard pipeline in Natural Language Processing (including tokenization), and
thus can run in standalone fashion on raw text
Metal Fluorides as Analogs for Studies on Phosphoryl Transfer Enzymes
The 1994 structure of a transition state analog with AlF4- and GDP complexed to G1, a small G protein, heralded a new field of research into structure and mechanism of enzymes that manipulate transfer of the phosphoryl (PO3-) group. The list of enzyme structures that embrace metal fluorides, MFx, as ligands that imitate either the phosphoryl group or a phosphate, is now growing at over 80 per triennium. They fall into three distinct geometrical classes: (i) Tetrahedral complexes, based on BeF3-, mimic ground state phosphates; (ii) Octahedral complexes, primarily based on AlF4-, mimic "in-line" anionic transition state for phosphoryl transfer; and (iii) Trigonal bipyramidal complexes, represented by MgF3- and putative AlF30 moieties, additionally mimic the tbp geometry of the transition state. The interpretation of these structures provides a deeper mechanistic understanding of the behavior and manipulation of phosphate monoesters in molecular biology. This review provides a comprehensive overview of these structures, their uses, and their computational development. It questions the identification of AlF30 and MgF4= as tbp species in protein complexes and discusses the relevance of physical organic chemistry and water-based model studies for understanding phosphoryl group transfer in enzymes. It describes two roles for amino acid side-chains that mediate proton transfers during phosphoryl transfer, based on the analysis of protein/MFx structures. First, they deploy hydrogen bonding to neutral oxygen nucleophiles so as to orientate them for correct orbital overlap with the electrophilic phosphorus center. Secondly, they behave as classical general acid/base catalysts
D2.2.5 MineSet TM
MineSetTM is a commercial data mining product from Silicon Graphics. It provides an interactive platform for data mining, integrating three powerful technologies: database and file access, analytical data mining engines, and data visualization. MineSet supports the knowledge discovery process from data access and preparation through iterative analysis and visualization to deployment. MineSet uses a clientserver architecture for scalability and support of large data. The data access component provides a rich set of transformations that can be used to process stored data into forms appropriate for visualization and analytical mining. MineSet’s 2D and 3D visualization capabilities allow direct data visualization for exploratory analysis. The analytical mining algorithms create models that can be viewed using visualization tools specialized for the learned models or deployed as part of a larger system. Third party vendors can interface to the MineSet tools for model deployment and for integration with other packages
Pruning Decision Trees with Misclassification Costs
decision tree classifiers in two learning situations: minimizing loss and probability estimation. In addition to the two most common methods for error minimization, CART\u27S cost-complexity pruning and C4.5\u27~ errorbased pruning, we study the extension of cost-complexity pruning to loss and two pruning variants based on Laplace corrections. We perform an empirical comparison of these methods and evaluate them with respect to the following three criteria: loss, mean-squared-error (MSE), and log-loss. We provide a bias-variance decomposition of the MSE to show how pruning affects the bias and variance. We found that applying the Laplace correction to estimate the probability distributions at the leaves was beneficial to all pruning methods, both for loss minimization and for estimating probabilities. Unlike in error minimizat,ion, and somewhat surprisingly, performing no pruning led to results that were on par with other methods in ternis of the evaluation criteria. The main advantage of pruning was in the reduction of the decision tree size, sometimes by a factor of 10. While no method dominated others on all datasets, even for the same domain different pruning mechanisms are better for different loss matrices. We show this last result using Receiver Operating Characteristics (ROC) curves